Comments for MEDB 5501, Module01

2014-05-30

Topics to be covered

  • What you will learn
    • About this class
    • R and RStudio
    • History of R
    • Scales of measurement
    • Tests of hypothesis
    • Confidence intervals
    • A simple R program

Welcome to the class

  • Three sections
    • 0001, Synchronous Zoom meetings on Tuesdays
    • 0002, Asynchronous
    • 0003, International students
  • Introductions
    • Ricardo Moniz
    • Suman Sahil

Requirements of all students

  • Attend Tuesday Zoom or watch video of Tuesday Zoom
  • Read book chapter
  • Optional review session on Zoom on Fridays
  • Complete all assignments by Monday at 11:59pm

Attendance

  • Synchronous students
    • MUST attend Tuesday Zoom sessions.
    • can also review the recordings
  • Asynchronous students
    • watch the Tuesday recordings
    • or attend some of the Tuesday Zoom sessions
    • or both
  • Failure to attend is a problem

Assignments

  • Due Mondays at 11:59pm
  • Policy on Failure to submit on time is a problem

This class is in transition

  • Taught for many years by Dr. Monica Gaddis
  • My second time teaching this class
  • Major changes
    • Software agnosticism
      • In class switch from SPSS to R
    • Suman Sahil will co-teach and cover programming in R
    • Discussion boards ask for feedback
    • Proposed exam questions

Why R

  • Faculty consensus
    • Prepares you better for capstone/thesis work
    • More likely to be used in your future job
    • Integrates well with team based tools
  • R is not as difficult as claimed
    • Work from existing templates
  • Python, SAS, Stata would have also been good choices
    • You are welcome to use these if you like

Student learning objectives, 1 of 3

  • SLO1
    • The graduate will be able to use statistics to analyze and interpret data. They will understand the fundamentals of the field in the context of recognizing the effective use of data or information for the specific discipline(s). They will select and apply appropiate statistical procedures to the information. They will be able to analyze and accurately interpret of statistical result.

Student learning objectives, 2 of 3

  • SLO2
    • The graduate will be able to design a testable research question or hypothesis. They will have adequate background knowledge about biological, biomedical, or population health contexts and problems including common research problems in order to generate a research question or hypothesis. They will be able to relate problems within and across levels of areas of the spectrum to bridge disciplines.

Student learning objectives, 3 of 3

  • SLO5
    • The graduate will be able to communicate scientific outcomes. This includes the ability to convey scientific methods and statistical findings, effectively field questions in an oral presentation format as well as in the preparation of thesis or capstone manuscripts.

Break #1

  • What you have learned
    • About this class
  • What’s coming next
    • R and RStudio

R

RStudio

Other statistical software

  • JMP, Python, R, SAS, SPSS, Stata
    • Use only if you are confident in your abilities
    • Avoid Microsoft Excel

Break #2

  • What you have learned
    • R and RStudio
  • What’s coming next
    • History of R

R sprouted from S

Figure 1. Book cover

John Chambers

Figure 2. Photo of John Chambers

Richard Becker

Figure 3. Photo of Richard Becker

Allan Wilks

Figure 4. Photo of Allan Wilks

Bell Labs

Figure 5. Aerial photograph of Bell Laboratories

Features of S.

  • Intended for internal use.
  • Freely available to anyone.
  • Interactive
  • Unique capabilities
    • Emphasis on functions
    • Object-oriented features

S-plus

Figure 6. Venables and Ripley book cover

Beginnings of R (1/2)

Figure 7. Excerpt from research paper

Beginnings of R (2/2)

Figure 8. CD of release 1.0 of R

Growth in popularity

Figure 9. Excerpt from New York Times article

R Foundation

Figure 10. Excerpt from website

Revolution Analytics

Figure 11. Excerpt from article

R packages

Figure 12. Excerpt from website

Bioconductor

Figure 13. Excerpt from website

BUGS

Figure 14. Excerpt from website

RStudio

Figure 16. Excerpt from website

RMarkdown

Figure 17. Excerpt from website

Recent major contributions: Frank Harrell

Figure 18. Title slide from Frank Harrell talk

Recent major contributions: Hadley Wickham

Figure 19. Title slide from presentation

The tidyverse library

Figure 20. Hex sticker for tidyverse

dplyr

Figure 21. Hex sticker for dplyr

ggplot2

Figure 22. Hex sticker for ggplot2

magrittr

Figure 23. Hex sticker for magrittr

readr

Figure 24. Hex sticker for readr

stringr

Figure 25. Hex sticker for stingr

tibble

Figure 26. Hex sticker for tibble

tidyr

Figure 27. Hex sticker for tidyr

Other packages in the tidyverse

  • In the core package
    • forcats
    • purr
  • Outside the core package
    • broom
    • lubridate
    • readxl
    • many others

Recent major contributions: Yihui Xie

Figure 28. Exceprt from github site

knitr

Figure 29. Hex sticker for knitr

bookdown

Figure 30. Hex sticker for bookdown

Other works by Yihui Xie

  • blogdown
  • tinytex
  • xaringan

RStudio renamed as Posit

Figure 31. Excerpt from Posit blog

Quarto

Figure 32. Hex sticker for Quarto

Positron

Figure 33. Excerpt from README file

If you want to learn more: Rickert 2014

Figure 34. Excerpt from blog post

If you want to learn more: Chambers 2006

Figure 35. Title slide from presentation

If you want to learn more: Hastie 2014

Figure 36. Excerpt from blog post

If you want to learn more: Ihaka 1998

Figure 37. Excerpt from research paper

If you want to learn more: Becker (no date)

Figure 38. Excerpt from paper

If you want to learn more: Smith 2020

Figure 39. Excerpt from website

Break #3

  • What you have learned
    • History of R
  • What’s coming next
    • Scales of measurement

Scales of measurement

  • Dichotomy
    • Continuous
    • Categorical
  • Stevens scales of measurement (controversial!)
    • Nominal
    • Ordinal
    • Interval
    • Ratio
  • Addition/subtraction not allowed for ordinal data
    • Mean of ordinal data is meaningless

An example of ordinal data.

  • “Do you agree or disagree with the following statements”
    • “I believe that knowledge of Statistics is important for my job.”
      • 1 = Strongly disagree,
      • 2 = Disagree
      • 3 = Neutral
      • 4 = Agree
      • 5 = Strongly agree

Another example of ordinal data, course grades

  • A = 4
  • B = 3
  • C = 2
  • D = 1
  • F = 0

Break #4

  • What you have learned
    • Scales of measurement
  • What’s coming next
    • Tests of hypothesis

What is a population?

  • Population: a group that you wish to generalize your research results to. It is defined in terms of
    • Demography,
    • Geography,
    • Occupation,
    • Time,
    • Care requirements,
    • Diagnosis,
    • Or some combination of the above.

Example of a population

All infants born in the state of Missouri during the 1995 calendar year who have one or more visits to the Emergency room during their first year of life.

What is a sample?

  • Sample: subset of a population.
  • Random sample: every person has the same probability of being in the sample.
  • Biased sample: Some people have a decreased probability of being in the sample.
    • Always ask “who was left out?”

An example of a biased sample

  • A researcher wants to characterize illicit drug use in teenagers. She distributes a questionnaire to students attending a local public high school
  • (in the U.S. high school is grades 9-12, which is mostly students from ages 14 to 18.)
  • Explain how this sample is biased.
  • Who has a decreased or even zero probability of being selected.

Type your ideas in the chat box.

Fixing a biased sample

  • Redfine your population
    • Not all teenagers,
      • but those attending public high schools.

What is a parameter?

  • A parameter is a number computed from a population.
    • Examples
      • Average health care cost associated with the 29,637 children
      • Proportion of these 29,637 children who died in their first year of life.
      • Correlation between gestational age and number of ER visits of these 29,637 children.
    • Designated by Greek letters (\(\mu\), \(\pi\), \(\rho\))

What is a statistic?

  • A statistic is a number computed from a sample
    • Examples
      • Average health care cost associated with 100 children.
      • Proportion of these 100 children who died in their first year of life.
      • Correlation between gestational age and number of ER visits of these 100 children.
    • Designated by non-Greek letters (\(\bar{X}\), \(\hat{p}\), r).

What is Statistics?

  • Statistics
    • The use of information from a sample (a statistic) to make inferences about a population (a parameter)
      • Often a comparison of two populations

What is the null hypothesis?

  • The null hypothesis (\(H_0\)) is a statement about a parameter.
  • It implies no difference, no change, or no relationship.
    • Examples
      • \(H_0:\ \mu_1 - \mu_2 = 0\)
      • \(H_0:\ \pi_1 - \pi_2 = 0\)
      • \(H_0:\ \rho = 0\)

What is the alternative hypothesis?

  • The alternative hypothesis (\(H_1\) or \(H_a\)) implies a difference, change, or relationship.
    • Examples
      • \(H_1:\ \mu_1 - \mu_2 \ne 0\)
      • \(H_1:\ \pi_1 - \pi_2 \ne 0\)
      • \(H_1:\ \rho \ne 0\)

Hypothesis in English instead of Greek

  • Only statisticians like Greek letters
    • Translate to simple text
    • For two group comparisons
      • Safer, more effective
    • For regression models
      • Trend, association

Use PICO

  • P = patient population
  • I = intervention
  • C = control
  • O = outcome

Example of text hypotheses (1/2)

  • “… the objective of this 78-week randomised, placebo-controlled study was to determine whether treatment with nilvadipine sustained-release 8 mg, once a day, was effective and safe in slowing the rate of cognitive decline in patients with mild to moderate Alzheimer disease.”
    • Lawlor B, Segurado R, Kennelly S, et al. Nilvadipine in mild to moderate Alzheimer disease: A randomised controlled trial. PLoS Med. 2018; 15(9): e1002660. DOI: 10.1371/journal.pmed.1002660

PICO for this study

  • P = patients with mild to moderate Alzheimer disease
  • I = Nilvadine
  • C = placebo
  • O = cognitive function

Example of text hypotheses (2/2)

  • “… we investigated trends in BCC incidence over a span of 20 years and the associations between incident BCC and risk factors in a total population of 140,171 participants from 2 large US-based cohort studies: women in the Nurses’ Health Study (NHS; 1986–2006) and men in the Health Professionals’ Follow-up Study (HPFS; 1988–2006).”
    • Wu S, Han J, Li WQ, Li T, Qureshi AA. Basal-cell carcinoma incidence and associated risk factors in U.S. women and men. Am J Epidemiol. 2013; 178(6): 890–897. DOI: 10.1093/aje/kwt073

PICO for this study

  • P = female nurses/male health professionals
  • I = various risk factors
  • C = absence of various risk factors
  • O = presence/absence of BCC

One-sided alternatives

  • Examples
    • \(H_1:\ \mu_1 - \mu_2 \gt 0\)
    • \(H_1:\ \pi_1 - \pi_2 \gt 0\)
    • \(H_1:\ \rho \gt 0\)
  • Changes in only one direction expected
  • Changes in opposite direction uninteresting

Passive smoking controversy

  • EPA meta-analysis of passive smoking
    • Criticized for using a one-sided hypothesis
    • Samet JM, Burke TA. Turning science into junk: the tobacco industry and passive smoking. Am J Public Health. 2001;91(11):1742–1744.

What is a decision rule? (1/3)

  • Example
    • \(H_0:\ \mu_1 - \mu_2 = 0\)
    • \(H_1:\ \mu_1 - \mu_2 \ne 0\)
    • t = (\(\bar{X}_1-\bar{X}_2\)) / se
    • Accept \(H_0\) if t is close to zero.

What is a decision rule? (2/3)

  • Example
    • \(H_0:\ \pi_1 - \pi_2 = 0\)
    • \(H_1:\ \pi_1 - \pi_2 \ne 0\)
    • t = (\(\hat{p}_1-\hat{p}_2\)) / se
    • Accept \(H_0\) if t is close to zero.

What is a decision rule? (3/3)

  • Example
    • \(H_0:\ \rho = 0\)
    • \(H_1:\ \rho \ne 0\)
    • t = r / se
    • Accept \(H_0\) if t is close to zero.

What is a Type I error?

  • A Type I error is rejecting the null hypothesis when the null hypothesis is true
    • False positive
    • Example involving drug approval: a Type I error is allowing an ineffective drug onto the market.
  • \(\alpha\) = P[Type I error]

What is a Type II error?

  • A Type II error is accepting the null hypothesis when the null hypothesis is false.
    • False negative result
    • Usually computed at MCD
    • An example involving drug approval: a Type II error is keeping an effective drug off of the market.
  • \(\beta\) = P[Type II error]
  • Power = \(1-\beta\)

What is a p-value?

  • Let t =
    • (\(\bar{X}_1-\bar{X}_2\)) / se, or
    • (\(\hat{p}_1-\hat{p}_2\)) / se, or
    • r / se
  • p-value = Prob of sample result, t, or a result more extreme,
    • assuming the null hypothesis is true
  • Small p-value, reject \(H_0\)
  • Large p-value, accept \(H_0\)

Alternate interpretations

  • Consistency between the data and the null
    • Small value, inconsistent
    • Large value, consistent
  • Evidence against the null
    • Small, lots of evidence against the null
    • Large, little evidence against the null

What the p-value is not (1/2)

  • A p-value is NOT the probability that the null hypothesis is true.
    • P[t or more extreme | null] is different than
    • P[null | t or more extreme]
      • P[null] is nonsensical
      • \(\mu\), \(\pi\), or \(\rho\) are unknown constants (no sampling error)

What the p-value is not (2/2)

  • Not a measure FOR either hypothesis
    • Little evidence against the null \(\ne\) lots of evidence for the null
  • Not very informative if it is large
    • Need a power calculation, or
    • Narrow confidence interval
  • Not very helpful for huge data sets

A research paper computes a p-value of 0.45. How would you interpret this p-value?

  1. Strong evidence for the null
  2. Strong evidence for the alternative
  3. Little or no evidence for the null
  4. Little or no evidence for the alternative
  5. More than one answer above is correct.
  6. I do not know the answer.

Figure 1: xkcd cartoon about jelly beans and cancer

What is p-hacking?

  • Abuse of the hypothesis testing framework.
    • Run multiple tests on the same outcome
    • Test multiple outcome measures
    • Remove outliers and retest
  • Defenses against p-hacking
    • Bonferroni
    • Primary versus secondary
    • Published protocol

Break #5

  • What you have learned
    • Tests of hypothesis
  • What’s coming next
    • Confidence intervals

What is a confidence interval?

  • Range of plausible values
    • Tries to quantify uncertainty associated with the sampling process.

Example of a confidence interval

  • Homeopathic treatment of swelling after oral surgery
    • 95% CI: -5.5 to 7.5 mm
    • Lokken P, Straumsheim PA, Tveiten D, Skjelbred P, Borchgrevink CF. Effect of homoeopathy on pain and other events after acute trauma: placebo controlled trial with bilateral oral surgery BMJ. 1995;310(6992):1439-1442.

Confidence interval interpretation (1 of 7)

Figure 2: Interval that contains the null value

Confidence interval interpretation (2 of 7)

Figure 3: Interval entirely above the null value

Confidence interval interpretation (3 of 7)

Figure 4: Interval entirely below the null value

Confidence interval interpretation (4 of 7)

Figure 5: Interval entirely inside the range of clinical indifference

Confidence interval interpretation (5 of 7)

Figure 6: Interval partly inside/outside range of clinical indifference

Quiz question, revisited

A research paper computes a confidence interval for a relative risk of 0.82 to 3.94. This confidence interval tells that the result is

  1. statistically significant and clinically important.
  2. not statistically significant, but is clinically important.
  3. statistically significant, but not clinically important.
  4. not statistically significant, and not clinically important.
  5. The result is ambiguous.
  6. I do not know the answer.

Confidence interval interpretation (6 of 7)

Figure 7: Confidence interval entirely inside the range of clinical indifference

Confidence interval interpretation (7 of 7)

Figure 8: Confidence interval entirely outside the range of clinical indifference

Why you might prefer a confidence interval

  • Provides same information as p-value,
    • Clinical importance
    • Distinguish between
      • definitive negative result, or
      • more research is needed

Break #6

  • What you have learned
    • Confidence intervals
  • What’s coming next
    • A simple R program

Grading rubric

simon-5501-01-template.qmd, part 1

---
title: "Template for 5501-01 programming assignment"
author: "Steve Simon"
format: 
  html:
    embed-resources: true
date: 2024-08-18
---

This program reads data on housing prices in Albuquerque, New Mexico in 1993. Find more information in the [data dictionary][dd].

[dd]: https://github.com/pmean/datasets/blob/master/albuquerque-housing.yaml

This code is placed in the public domain.

simon-5501-01-template.qmd, part 2

## Load the tidyverse library

For most of your programs, you should load the tidyverse library. The messages and warnings are suppressed.

```{r setup}
#| message: false
#| warning: false
library(tidyverse)
```

simon-5501-01-template.qmd, part 3

## Read the data and view a brief summary

Use the read_csv function to read the data. The glimpse function will produce a brief summary.

```{r read}
alb <- read_csv(
  file="../data/albuquerque-housing.csv",
  col_names=TRUE,
  col_types="nnnnccc",
  na=".")
glimpse(alb)
```

simon-5501-01-template.qmd, part 4

## Calculate overall means

The summarize_if function produces means, but only for numeric data. You wouldn't want to compute means for data with values "yes" and "no".

```{r means}
alb |>
  summarise_if(is.numeric, mean, na.rm = TRUE)
```

simon-5501-01-template.qmd, part 5

## Summarize price

The average price of a home, 106 thousand dollars, is quite low because the data comes from 1993.

## Summarize sqft

## Summarize age

## Summarize features

Summary

  • What you have learned
    • About this class
    • R and RStudio
    • History of R
    • Scales of measurement
    • Tests of hypothesis
    • Confidence intervals
    • A simple R program